The dataset is presented by Cortez et al. (see reference below),which contains about 5,000 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The dataset contains 13 variables for 4898 observations. The 13 variables include the index variable called ‘X’, ‘quality’ for rate quality rating and 11 chemical features for the white wines. The quality variable is an integer which has a min 3.0 and max 9.0, with a median 6.0 and mean 5.878.All the chemical feature variables are floating numbers. They are of different unit and therefore lie in widely different range. For example, the chlorides variable has a small range from 0.009 to 0.346, while the total.sulfur.dioxide variable has a large range from 8.0 to 440.0.There are some outliers for some features such as residual sugar, chlorides.
The features interest me are the quality and alcohol. I expected that alcohol and some combination of other variables can be used to build a predictive model to the wine quality.
Features such as residual sugar, citric acid, chlorides and pH may also help.
I created an ordered factor for quality from its orignal integer value. Furthermore, I grouped the wine quality into 3 buckets [(3,4,5), (6), (7,8,9)] so that we got more samples in each bucket for better analysis.
I made histograms for quality, density, citric acid, chlorides, pH, fixed acidity, sulphates and free sulfur dioxide. Outliers were removed.
After trying different bin size, I found that the density is little bit positive skewed.Others are basically in normal distribution,except for some outliers, like the long tail in chlorides, which has been removed.
In this plot I draw the histogram and density of alcohol level. The binwidth of the histogram is set to 0.1, and the density is estimated with a Gaussian kernel with default adjust=1.From the plot we see that the alcohol level in the sample set is positive skewed. More specifically,there are more wines with lower alcohol level (9 to 10) than those with higher alcohol level (11 to 12).
Groups of box plots were made for density, pH, citric acid and alcohol level for each different quality. I found that there is a clear dependency between alcohol and quality, the alcohol level tends to be high for both low quality and high quality wines, but low for medium quality wines. This is a very interesting observation.
## Correlation: 0.4355747
We can see that the highest quality wine (9) has quite concentrated alcohol level, in other words, the variance of alcohol level for wine of this quality is low. Later I realized that this is because there are very few samples (5 in total) with quality score being 9, and therefore the small variance could partly be attributed to lack of data.
I made this plot to better shows the quality of wine v.s. the alcohol level. A scatter plot with alpha=0.5 plus some jittering to show visualize the actual distribution of the alcohol and different quality level. In addition, the 10% median and 90% quality bars and boxplot geoms were placed for better visualizing the general trend of data.From the exploration above, it was found that the alcohol is the feature with largest correlation (0.435) to wine quality amoung all the given features. We can see that for wine samples of quality 5 or larger, the quality gets better as the median alcohol level grows. However, we also see that low quality wines (3 and 4) also tends to have higher alcohol level.This observation is very interesting for me, it also indicates there should be other variables that influence the quality.
## Correlation: -0.009209091
This is the blox plot for quality and citric acid, we can see that there’s no relationship between quality and citric acid.
## Correlation: 0.09942725
This is the blox plot for quality and pH, we can see that except for the lowest quality group, the higher quality have slightly higher pH.
## Correlation: -0.3071233
This is the blox plot for quality and density, we can see that wines with higher density have higher density. That’s make sense since density is inverse to alcohol level since alcohol is lighter than water.
## Correlation: -0.7801376
This is the scatter plot for density and alcohol, I made this plot to verify the density is inverse to alcohol level.
## Correlation: -0.1136628
This is the blox plot for quality and fixed acidity level, we can see that there’s no relationship between quality and fixed acidity.
## Correlation: 0.05367788
This is the blox plot for quality and sulphates level, we can see that there’s no relationship between quality and sulfates level.
## Correlation: 0.2891807
This is the scatter plot for citric acid and fixed acidity. We can see there’s a slight correlation between these two variables, that make sense since the acid should related to acidity.
## Correlation: 0.05921725
This is the scatter plot for sulphates and free sulfur dioxide. We can see there is no correlation between the two variables.I’m not very familiar with chemistry, seems sulphates and sulfur dioxide are two independent substances.
I plotted the chlorides with respect to alcohol in the figure below, and grouped and colored by different wine quality.
From this plot we can see that higher quality group tends to have higher alcohol level and lower chlorides level. I also added the scatter plot of all data points, and we can see the variation of chlorides given alcohol is quite large, but the general trend is visible: low quality wines (red points) tend to have larger chlorides than high quality wines (blue points).
Below is a scatter plot which shows the relationship between density, alcohol
and quality.
It is very clear to see that density is inversely proportional to alcohol, and the red points(quality below 5) are mainly locating in low alcohol area, blue points(quality above six) are mainly locating in high alcohol area.
I found that fixed acidity is indenpenent with volatile acidity, I expected that they should be negatively related since the acidity can be either volatile or fixed. And we can also see that lower quality group tends to have higher volatile acidity level.
In this plot I draw the histogram and density of alcohol level. The binwidth of the histogram is set to 0.1, and the density is estimated with a Gaussian kernel with default adjust=1.From the plot we see that the alcohol level in the sample set is positive skewed. More specifically,there are more wines with lower alcohol level (9 to 10) than those with higher alcohol level (11 to 12).
This plot shows the quality of wine v.s. the alcohol level. A scatter plot with alpha=0.5 plus some jittering to show visualize the actual distribution of the alcohol and different quality level. In addition, the 10%m median and 90% quality bars and boxplot geoms were placed for better visualizing the general trend of data.From the exploration above, it was found that the alcohol is the feature with largest correlation (0.435) to wine quality amoung all the given features. We can see that for wine samples of quality 5 or larger, the quality gets better as the median alcohol level grows. However, we also see that low quality wines (3 and 4) also tends to have higher alcohol level.This observation is very interesting for me, it also indicates there should be other variables that influence the quality.
In this plot I made a scatter plot of alcohol versus chlorides, colored by the wine quality. From this plot we can see some distinct phenomenons of combining to different features to make better prediction about the wine quality. It is clear that wine with higher alcohol tends to have higher quality. And we can also see that chlorides level also influences the wine quality. The chlorides level for the higher quality group (blue) is likely below 0.05.
I have some reflections for this project:
Having a reasonable size for the dataset is important. When there are too few number of data point, the statistical analysis might be less reliable. For example, there are only 5 samples of quality 9 wine, and a box plot or quantile computed from this 5 samples might not be as robust as the one that is from, say 500 samples.
Understanding the range and distribution of data is very important. It is usually very helpful to first plot the histogram of the variables in order to get a sense of how well they are distributed, and decide a reasonable axis scale to present them. Without such a step, the result visualization can be very skewed and hard to interpret.
Some unexpected results is not necessarily wrong; they might just be the fact that we overlooked before. For example, I expect conditioned on wine quality, the curves of one physical/chemical property against another should be distinguishable from another. This however is not true, as discovered from analysis, those relationships are often governed by physical/chemical laws and therefore not very dependent on human tastes.
For future exploration for this dataset, a math model (can be linear or non-linear) can be built to predict the quality, I believe that 11 features and nearly 5000 data can lead to a very good model.